knitr::opts_chunk$set(comment = NA)
library(janitor)
library(naniar)
library(googlesheets4) ## to read in data from shared Google Sheet
library(knitr); library(kableExtra) ## to make prettier tables
library(rstanarm) ## to fit Bayesian linear models
library(MKinfer) ## for bootstrap t tests
library(infer) ## for bootstrapping differences in medians
library(DescTools) ## for post-hoc testing in ANOVA
library(car) ## for boxCox
library(xfun) ## for session information
library(easystats)
library(tidyverse)
theme_set(theme_bw())
source("Love-431.R") ## for lovedist() functionAn Update on the Favorite Movies
Addressing Exploratory Questions developed by the 431 Class
1 Questions This Work Addresses
1.1 Data Management Questions
- How do we ingest data from a Google Sheet?
- How do we clean up some of the favorite movies data?
- How might we deal with a “check all that apply” item like
imdb_genresfrom our favorite movies?
1.2 Analytic Questions
- Do movies released in 1940-2005 have more user ratings than movies released after 2005?
- Which MPA categories have higher average IMDB star ratings?
- How strong is the association between the number of IMDB user ratings (
imdb_ratings) and the weighted average star rating (imdb_stars)? - Which movie genres have the highest weighted average star ratings on IMDB?
2 R Setup
3 Ingesting the Data
The Google Sheet containing the movies_2025-09-30 data is found at the following URL:
gs4_deauth()
url_movies <- "https://docs.google.com/spreadsheets/d/1CnbaDAeFoTzvI_V7b3-C0Gy6c9sdNsFWztACLJ4PMAw/edit?usp=drive_link"
movies_raw <- read_sheet(url_movies) |>
janitor::clean_names()✔ Reading from "movies_2025-09-30".
✔ Range 'movies_2025-09-30'.
dim(movies_raw)[1] 260 11
This initial result should (and does) include 11 variables, and 260 movies.
3.1 Data Cleaning
3.1.1 Are there any missing values?
n_miss(movies_raw) ## any missing values?[1] 0
If we had any missing values indicated above, we’d probably want to run miss_var_summary() and miss_case_table() to understand the situation better, but that’s not necessary here.
3.1.2 Restricting to Key Variables and Minor Changes
Here’s a glimpse at our current tibble:
glimpse(movies_raw)Rows: 260
Columns: 11
$ mov_id <chr> "M-001", "M-002", "M-003", "M-004", "M-005", "M-006", "M…
$ movie <list> "3 Idiots", "8 1/2", "10 Things I Hate About You", "20t…
$ year <dbl> 2009, 1963, 1999, 2016, 2006, 1968, 2009, 2013, 1988, 19…
$ length <dbl> 170, 138, 97, 119, 117, 149, 119, 123, 124, 117, 160, 16…
$ imdb_genres <chr> "Comedy, Drama", "Drama", "Comedy, Drama, Romance", "Com…
$ imdb_ratings <dbl> 467000, 130000, 430000, 52000, 898000, 764000, 60000, 42…
$ imdb_stars <dbl> 8.4, 8.0, 7.4, 7.3, 7.6, 8.3, 7.9, 7.8, 8.0, 8.5, 8.4, 7…
$ mpa <chr> "PG-13", "Not Rated", "PG-13", "R", "R", "G", "TV-PG", "…
$ imdb_synopsis <chr> "Two friends are searching for their long lost companion…
$ list_25 <dbl> 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0,…
$ imdb_link <chr> "https://www.imdb.com/title/tt1187043", "https://www.imd…
We’re going to focus on 6 variables in our analyses today, plus the two identifying variables (mov_id and movie). They are year, length, imdb_genres, imdb_ratings, imdb_stars, and mpa. We’ll also:
- rearrange the data a little,
- ignore the variables we’re not using today,
- look at
ratings100K, which is the number of user ratings divided by 100,000, so that this is scaled more similarly to our other variables, and - represent the movie names as a character variable, rather than as a list, which is what R defaulted to when ingesting the data.
movies_1 <- movies_raw |>
select(mov_id, year, length, imdb_ratings, imdb_stars, mpa,
imdb_genres, movie) |>
mutate(ratings100K = imdb_ratings/100000) |>
mutate(movie = as.character(movie))
glimpse(movies_1)Rows: 260
Columns: 9
$ mov_id <chr> "M-001", "M-002", "M-003", "M-004", "M-005", "M-006", "M-…
$ year <dbl> 2009, 1963, 1999, 2016, 2006, 1968, 2009, 2013, 1988, 197…
$ length <dbl> 170, 138, 97, 119, 117, 149, 119, 123, 124, 117, 160, 162…
$ imdb_ratings <dbl> 467000, 130000, 430000, 52000, 898000, 764000, 60000, 420…
$ imdb_stars <dbl> 8.4, 8.0, 7.4, 7.3, 7.6, 8.3, 7.9, 7.8, 8.0, 8.5, 8.4, 7.…
$ mpa <chr> "PG-13", "Not Rated", "PG-13", "R", "R", "G", "TV-PG", "R…
$ imdb_genres <chr> "Comedy, Drama", "Drama", "Comedy, Drama, Romance", "Come…
$ movie <chr> "3 Idiots", "8 1/2", "10 Things I Hate About You", "20th …
$ ratings100K <dbl> 4.670, 1.300, 4.300, 0.520, 8.980, 7.640, 0.600, 4.200, 2…
3.1.3 Are the identifying variables distinct?
We want to make sure we have a different value of mov_id and movie for each row of our data.
nrow(movies_1)[1] 260
n_distinct(movies_1$mov_id)[1] 260
n_distinct(movies_1$movie)[1] 260
OK. Each of these are 260, which is comforting.
3.1.4 Check Ranges of Quantities
We have four quantitative variables here, so let’s check the range of their values.
movies_1 |> reframe(range(imdb_stars), range(ratings100K),
range(length), range(year))# A tibble: 2 × 4
`range(imdb_stars)` `range(ratings100K)` `range(length)` `range(year)`
<dbl> <dbl> <dbl> <dbl>
1 3.4 0.013 70 1940
2 9.3 31 207 2024
imdb_stars(weighted average star rating) must be between 1 and 10 stars, and the range we see is (3.4, 9.3). That seems reasonable.ratings100Kranges from 0.13 (or 1,300 users) to 31 (or 3,100,000 users) who’ve rated the movie. Fine.lengthof the movie ranges from 70 to 207 minutes. Seems OK.yearmovie was released covers the period from 1940 to 2024. Also seems reasonable.
3.1.5 mpa is categorical
Let’s assess the levels of mpa rating, and consider how to create a factor representation of that information in R.
movies_1 |> count(mpa)# A tibble: 10 × 2
mpa n
<chr> <int>
1 Approved 2
2 G 8
3 Not Rated 14
4 PG 67
5 PG-13 84
6 R 80
7 TV-14 1
8 TV-G 2
9 TV-MA 1
10 TV-PG 1
Since only PG, PG-13 and R have fairly large sample sizes, let’s collapse all but the three most common categories together, and then create a four-level factor to represent mpa rating.
Fortunately, the forcats package (part of the tidyverse) has a tool for this task.
movies_1 <- movies_1 |>
mutate(mpa4 = fct_lump_n(mpa, n = 3, other_level = "Other"))fct_lump_n() with n = 3 collapses all mpa values that occur less often than the top 3 mpa values.
Let’s do a little sanity check to ensure that the relationship between mpa and mpa4 matches what we expect.
movies_1 |> tabyl(mpa, mpa4) |>
adorn_totals(where = "row") |> adorn_title(placement = "combined") mpa/mpa4 PG PG-13 R Other
Approved 0 0 0 2
G 0 0 0 8
Not Rated 0 0 0 14
PG 67 0 0 0
PG-13 0 84 0 0
R 0 0 80 0
TV-14 0 0 0 1
TV-G 0 0 0 2
TV-MA 0 0 0 1
TV-PG 0 0 0 1
Total 67 84 80 29
Looks good.
3.1.6 Dealing with imdb_genres: A bigger task
A genre is a category of a work of art defined by a particular set of shared conventions, content, form or style. These categories help organize and classify works based on common characteristics, allowing audiences and critics to understand what to expect from a piece.
As we discussed in Class 09, each value in imdb_genres lists between 1 and 8 of the following 20 types of movie.
- Action, Adventure, Animation, Biography, Comedy, Crime, Drama, Family, Fantasy, History,
- Horror, Mystery, Music, Musical, Romance, Sci-Fi, Sport, Thriller, War, Western
movies_1 |> count(imdb_genres) |> arrange(desc(n))# A tibble: 145 × 2
imdb_genres n
<chr> <int>
1 Drama, Romance 13
2 Drama 12
3 Comedy, Drama 11
4 Comedy, Drama, Romance 11
5 Comedy 8
6 Action, Adventure, Fantasy, Sci-Fi 7
7 Action, Adventure, Sci-Fi 7
8 Action, Adventure, Thriller 5
9 Action, Adventure, Sci-Fi, Thriller 4
10 Crime, Drama, Thriller 4
# ℹ 135 more rows
In 260 movies, we see 145 different combinations of genres in the imdb_genres variable. As we see from this table, the most common result is that 13 movies list Drama and Romance, while, for example, 12 list only Drama.
3.1.7 Building Indicators for Genres
To represent these data in a more useful way, I suggest building a set of 20 indicator variables (which take the value 1 or 0) to indicate the presence (1) or absence (0) of each of the 20 available genres for that movie.
One way to do this (admittedly, a tedious bit of coding) is to use the str_detect() function from the stringr package (part of the tidyverse) as follows:
movies_1 <- movies_1 |>
mutate(action = as.numeric(str_detect(imdb_genres, fixed("Action"))),
adventure = as.numeric(str_detect(imdb_genres, fixed("Adventure"))),
animation = as.numeric(str_detect(imdb_genres, fixed("Animation"))),
biography = as.numeric(str_detect(imdb_genres, fixed("Biography"))),
comedy = as.numeric(str_detect(imdb_genres, fixed("Comedy"))),
crime = as.numeric(str_detect(imdb_genres, fixed("Crime"))),
drama = as.numeric(str_detect(imdb_genres, fixed("Drama"))),
family = as.numeric(str_detect(imdb_genres, fixed("Family"))),
fantasy = as.numeric(str_detect(imdb_genres, fixed("Fantasy"))),
history = as.numeric(str_detect(imdb_genres, fixed("History"))),
horror = as.numeric(str_detect(imdb_genres, fixed("Horror"))),
mystery = as.numeric(str_detect(imdb_genres, fixed("Mystery"))),
music = as.numeric(str_detect(imdb_genres, fixed("Music"))),
musical = as.numeric(str_detect(imdb_genres, fixed("Musical"))),
romance = as.numeric(str_detect(imdb_genres, fixed("Romance"))),
scifi = as.numeric(str_detect(imdb_genres, fixed("Sci-Fi"))),
sport = as.numeric(str_detect(imdb_genres, fixed("Sport"))),
thriller = as.numeric(str_detect(imdb_genres, fixed("Thriller"))),
war = as.numeric(str_detect(imdb_genres, fixed("War"))),
western = as.numeric(str_detect(imdb_genres, fixed("Western"))),
)Let’s use these new variables to obtain counts of the number of films associated with each of the 20 genres. Since each value is either 1 (if that genre is in the movie’s list) or 0, we can sum up the columns to get these counts.
movies_1 |> select(action:war) |> colSums() action adventure animation biography comedy crime drama family
60 87 29 17 103 34 153 44
fantasy history horror mystery music musical romance scifi
60 6 12 26 30 17 62 51
sport thriller war
6 50 11
Now, we can answer questions like:
How many movies are categorized as both Drama and Mystery?
movies_1 |> count(drama, mystery)# A tibble: 4 × 3
drama mystery n
<dbl> <dbl> <int>
1 0 0 95
2 0 1 12
3 1 0 139
4 1 1 14
Can we list the 14 movies that are categorized as both Drama and Mystery (and perhaps other things)?
movies_1 |> filter(drama == 1, mystery == 1) |>
select(movie, imdb_stars, imdb_genres) # A tibble: 14 × 3
movie imdb_stars imdb_genres
<chr> <dbl> <chr>
1 About Elly (Darbareye Elly) 7.9 Drama, Mystery
2 Blade Runner 2049 8 Action, Drama, Mystery, Sci-Fi, T…
3 Blue Velvet 7.7 Crime, Drama, Mystery, Thriller
4 Cloud Atlas 7.4 Drama, Mystery, Sci-Fi, Thriller
5 Coco 8.4 Animation, Adventure, Drama, Fami…
6 The Conversation 7.7 Drama, Mystery, Thriller
7 Divergent 6.6 Action, Adventure, Drama, Mystery…
8 The Girl with the Dragon Tattoo 7.8 Crime, Drama, Mystery, Thriller
9 The Green Mile 8.6 Crime, Drama, Fantasy, Mystery
10 The Hateful Eight 7.8 Crime, Drama, Mystery, Thriller, …
11 Hereditary 7.3 Drama, Horror, Mystery, Thriller
12 Memento 8.4 Drama, Mystery, Thriller
13 The Prestige 8.5 Drama, Mystery, Sci-Fi, Thriller
14 The Rite 6 Drama, Horror, Mystery, Thriller
Are there any movies that include Comedy, Romance and Thriller?
movies_1 |> count(comedy, romance, thriller)# A tibble: 8 × 4
comedy romance thriller n
<dbl> <dbl> <dbl> <int>
1 0 0 0 84
2 0 0 1 45
3 0 1 0 26
4 0 1 1 2
5 1 0 0 67
6 1 0 1 2
7 1 1 0 33
8 1 1 1 1
movies_1 |> filter(comedy == 1, romance == 1, thriller == 1) |>
select(movie, imdb_stars, imdb_genres) # A tibble: 1 × 3
movie imdb_stars imdb_genres
<chr> <dbl> <chr>
1 Om Shanti Om (Peace Be With You) 6.8 Action, Comedy, Drama, Fantasy, M…
Suppose we wanted to build a tibble of the genre counts. Consider the following approach.
genre_counts <- movies_1 |>
select(action:western) |>
colSums() |>
t() |> as_tibble() |> pivot_longer(action:western) |>
rename(genre = name, movies = value) |>
arrange(desc(movies))
genre_counts# A tibble: 20 × 2
genre movies
<chr> <dbl>
1 drama 153
2 comedy 103
3 adventure 87
4 romance 62
5 action 60
6 fantasy 60
7 scifi 51
8 thriller 50
9 family 44
10 crime 34
11 music 30
12 animation 29
13 mystery 26
14 biography 17
15 musical 17
16 horror 12
17 war 11
18 history 6
19 sport 6
20 western 2
3.2 Numerical Summaries
3.2.1 data_codebook results
Here’s a brief description of each of the variables that we’re now thinking about using in an analysis, leaving out the identifying variables, and the original versions of the imdb_ratings, mpa and imdb_genres variables.
data_codebook(movies_1 |>
select(-mov_id, -movie, -imdb_ratings, -mpa, -imdb_genres))select(movies_1, -mov_id, -movie, -imdb_ratings, -mpa, -imdb_genres) (260 rows and 25 variables, 25 shown)
ID | Name | Type | Missings | Values | N
---+-------------+-------------+----------+--------------+------------
1 | year | numeric | 0 (0.0%) | [1940, 2024] | 260
---+-------------+-------------+----------+--------------+------------
2 | length | numeric | 0 (0.0%) | [70, 207] | 260
---+-------------+-------------+----------+--------------+------------
3 | imdb_stars | numeric | 0 (0.0%) | [3.4, 9.3] | 260
---+-------------+-------------+----------+--------------+------------
4 | ratings100K | numeric | 0 (0.0%) | [0.01, 31] | 260
---+-------------+-------------+----------+--------------+------------
5 | mpa4 | categorical | 0 (0.0%) | PG | 67 (25.8%)
| | | | PG-13 | 84 (32.3%)
| | | | R | 80 (30.8%)
| | | | Other | 29 (11.2%)
---+-------------+-------------+----------+--------------+------------
6 | action | numeric | 0 (0.0%) | 0 | 200 (76.9%)
| | | | 1 | 60 (23.1%)
---+-------------+-------------+----------+--------------+------------
7 | adventure | numeric | 0 (0.0%) | 0 | 173 (66.5%)
| | | | 1 | 87 (33.5%)
---+-------------+-------------+----------+--------------+------------
8 | animation | numeric | 0 (0.0%) | 0 | 231 (88.8%)
| | | | 1 | 29 (11.2%)
---+-------------+-------------+----------+--------------+------------
9 | biography | numeric | 0 (0.0%) | 0 | 243 (93.5%)
| | | | 1 | 17 ( 6.5%)
---+-------------+-------------+----------+--------------+------------
10 | comedy | numeric | 0 (0.0%) | 0 | 157 (60.4%)
| | | | 1 | 103 (39.6%)
---+-------------+-------------+----------+--------------+------------
11 | crime | numeric | 0 (0.0%) | 0 | 226 (86.9%)
| | | | 1 | 34 (13.1%)
---+-------------+-------------+----------+--------------+------------
12 | drama | numeric | 0 (0.0%) | 0 | 107 (41.2%)
| | | | 1 | 153 (58.8%)
---+-------------+-------------+----------+--------------+------------
13 | family | numeric | 0 (0.0%) | 0 | 216 (83.1%)
| | | | 1 | 44 (16.9%)
---+-------------+-------------+----------+--------------+------------
14 | fantasy | numeric | 0 (0.0%) | 0 | 200 (76.9%)
| | | | 1 | 60 (23.1%)
---+-------------+-------------+----------+--------------+------------
15 | history | numeric | 0 (0.0%) | 0 | 254 (97.7%)
| | | | 1 | 6 ( 2.3%)
---+-------------+-------------+----------+--------------+------------
16 | horror | numeric | 0 (0.0%) | 0 | 248 (95.4%)
| | | | 1 | 12 ( 4.6%)
---+-------------+-------------+----------+--------------+------------
17 | mystery | numeric | 0 (0.0%) | 0 | 234 (90.0%)
| | | | 1 | 26 (10.0%)
---+-------------+-------------+----------+--------------+------------
18 | music | numeric | 0 (0.0%) | 0 | 230 (88.5%)
| | | | 1 | 30 (11.5%)
---+-------------+-------------+----------+--------------+------------
19 | musical | numeric | 0 (0.0%) | 0 | 243 (93.5%)
| | | | 1 | 17 ( 6.5%)
---+-------------+-------------+----------+--------------+------------
20 | romance | numeric | 0 (0.0%) | 0 | 198 (76.2%)
| | | | 1 | 62 (23.8%)
---+-------------+-------------+----------+--------------+------------
21 | scifi | numeric | 0 (0.0%) | 0 | 209 (80.4%)
| | | | 1 | 51 (19.6%)
---+-------------+-------------+----------+--------------+------------
22 | sport | numeric | 0 (0.0%) | 0 | 254 (97.7%)
| | | | 1 | 6 ( 2.3%)
---+-------------+-------------+----------+--------------+------------
23 | thriller | numeric | 0 (0.0%) | 0 | 210 (80.8%)
| | | | 1 | 50 (19.2%)
---+-------------+-------------+----------+--------------+------------
24 | war | numeric | 0 (0.0%) | 0 | 249 (95.8%)
| | | | 1 | 11 ( 4.2%)
---+-------------+-------------+----------+--------------+------------
25 | western | numeric | 0 (0.0%) | 0 | 258 (99.2%)
| | | | 1 | 2 ( 0.8%)
----------------------------------------------------------------------
3.2.2 lovedist results on quantities
Let’s look at some general numerical summaries for the four quantitative variables (not including the binary indicator variables) we are analyzing.
dat1 <- bind_rows(
movies_1 |> reframe(lovedist(imdb_stars)),
movies_1 |> reframe(lovedist(ratings100K)),
movies_1 |> reframe(lovedist(length)),
movies_1 |> reframe(lovedist(year))
)
dat1 <- dat1 |>
mutate(dat1, var_name = c("imdb_stars", "ratings100K", "length", "year")) |>
relocate(var_name)
kable(dat1, digits = 1) | var_name | n | miss | mean | sd | med | mad | min | q25 | q75 | max |
|---|---|---|---|---|---|---|---|---|---|---|
| imdb_stars | 260 | 0 | 7.6 | 0.9 | 7.7 | 0.7 | 3.4 | 7.1 | 8.1 | 9.3 |
| ratings100K | 260 | 0 | 5.9 | 5.9 | 3.7 | 4.0 | 0.0 | 1.7 | 8.4 | 31.0 |
| length | 260 | 0 | 122.8 | 24.9 | 118.0 | 23.7 | 70.0 | 103.8 | 136.2 | 207.0 |
| year | 260 | 0 | 2002.7 | 14.8 | 2005.5 | 12.6 | 1940.0 | 1996.0 | 2013.0 | 2024.0 |
4 Question 1
- Do movies released in 1940-2005 have more user ratings than movies released after 2005?
Let’s build an appropriate factor called release to identify the two groups of movies being compared here, then obtain some numeric summaries of ratings100K (hundreds of thousands of user ratings) in each release group.
movies_1 <- movies_1 |>
mutate(release = fct_recode(factor(year > 2005),
After2005 = "TRUE", Older = "FALSE"))
movies_1 |> group_by(release) |> reframe(lovedist(ratings100K)) |>
kable(digits = 1)| release | n | miss | mean | sd | med | mad | min | q25 | q75 | max |
|---|---|---|---|---|---|---|---|---|---|---|
| Older | 130 | 0 | 6.3 | 6.2 | 3.7 | 3.8 | 0.1 | 1.8 | 9.3 | 31 |
| After2005 | 130 | 0 | 5.5 | 5.4 | 4.0 | 4.4 | 0.0 | 1.4 | 7.6 | 31 |
We see we have a balanced design here, with 130 movies in each release category. Let’s draw a picture to compare these independent samples.
ggplot(movies_1, aes(x = release, y = ratings100K)) +
geom_violin(aes(fill = release)) +
geom_boxplot(width = 0.2, notch = TRUE) +
stat_summary(fun = mean, geom = "point", size = 3, col = "red") +
scale_fill_oi() +
guides(fill = "none") +
labs(title = "Favorite Movies (2020-2025)",
x = "Release Category", y = "Hundreds of Thousands of IMDB User Ratings") +
coord_flip()Each distribution looks strongly right-skewed. So we have several available options for our analysis.
4.1 Transform the outcome
In an attempt to reduce the skew, I’d be interested in transforming the original count of user ratings to establish a new outcome. Let’s try that picture with a log transformation.
4.1.1 Logarithmic transformation?
ggplot(movies_1, aes(x = release, y = log(imdb_ratings))) +
geom_violin(aes(fill = release)) +
geom_boxplot(width = 0.2, notch = TRUE) +
stat_summary(fun = mean, geom = "point", size = 3, col = "red") +
scale_fill_oi() +
guides(fill = "none") +
labs(title = "Favorite Movies (2020-2025)",
x = "Release Category", y = "Logarithm of the # of IMDB User Ratings") +
coord_flip()Hmmmm… this looks a little disappointing. Have we transformed too much? Should we instead consider the square root of the number of IMDB ratings?
4.1.2 Square root transformation?
ggplot(movies_1, aes(x = release, y = sqrt(imdb_ratings))) +
geom_violin(aes(fill = release)) +
geom_boxplot(width = 0.2, notch = TRUE) +
stat_summary(fun = mean, geom = "point", size = 3, col = "red") +
scale_fill_oi() +
guides(fill = "none") +
labs(title = "Favorite Movies (2020-2025)",
x = "Release Category", y = "Square Root of # of IMDB User Ratings") +
coord_flip()Well, that seems much closer to a pair of distributions that could be reasonably well approximated by the Normal distribution.
4.2 Pooled t-test / linear model on sqrt(imdb_ratings)
So, let’s go ahead and run a linear model to obtain an appropriate two-sample pooled t test comparing the means of sqrt(ratings) across the two release groups.
fit1 <- lm(sqrt(imdb_ratings) ~ release, data = movies_1)
model_parameters(fit1, ci = 0.95, pretty_names = FALSE) Parameter | Coefficient | SE | 95% CI | t(258) | p
----------------------------------------------------------------------------
(Intercept) | 699.77 | 31.48 | [ 637.78, 761.76] | 22.23 | < .001
releaseAfter2005 | -45.11 | 44.52 | [-132.77, 42.55] | -1.01 | 0.312
Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
using a Wald t-distribution approximation.
For a movie released after 2005, our model fit1 estimates that this movie’s square root of number of IMDB user ratings will be, on average, 45.11 smaller than for a movie released in 1940-2005. There’s no concern here about using a pooled t approach (as opposed to the Welch t approach) given the balanced design.
Suppose we assume that the population of interest is “all favorite movies for students at CWRU in any class between 2020 and 2025” and we’re willing to believe that our sample of 260 movies is a representative (though not random) sample from that population. When we generalize beyond the movies selected here to the population of interest, and we assume that our model fit1 is correct, then our sample data are compatible (at the 95% confidence level) with differences in the square root of number of IMDB user ratings between 132.77 smaller in the “after 2005” group up to 42.55 larger in the “after 2025” group.
The frustrating point here is that we’re now talking about the square root of rankings, rather than just rankings. Hence, we might want to make some average predictions for each group, back on our original scale.
estimate_means(fit1, by = "release", ci = 0.95, transform = TRUE)Estimated Marginal Means
release | Mean | 95% CI | df
-------------------------------------------------
Older | 4.90e+05 | [4.07e+05, 5.80e+05] | 258
After2005 | 4.29e+05 | [3.51e+05, 5.14e+05] | 258
Variable predicted: imdb_ratings
Predictors modulated: release
So that’s a point estimate using this model back-transformed (by taking the square of the mean) out of the square root transformation which yield the estimates:
- 490,000 user ratings on average in the “Older” group with 95% CI (407,000, 580,000)
- and 429,000 user ratings on average in the “After 2005” group with 95% CI (351,000, 514,000).
4.3 Bayesian linear model?
We could have run a Bayesian model instead here, although it has the same issues with assumptions and the transformation as our fit1 model.
set.seed(43101)
fit1b <- stan_glm(sqrt(imdb_ratings) ~ release,
data = movies_1, refresh = 0 )
model_parameters(fit1b, ci = 0.95, pretty_names = FALSE)Parameter | Median | 95% CI | pd | Rhat | ESS | Prior
-------------------------------------------------------------------------------------------------
(Intercept) | 699.21 | [ 637.85, 762.95] | 100% | 1.000 | 4014 | Normal (677.21 +- 897.31)
releaseAfter2005 | -44.41 | [-133.16, 42.54] | 83.50% | 1.000 | 4015 | Normal (0.00 +- 1791.17)
Uncertainty intervals (equal-tailed) computed using a MCMC distribution
approximation.
estimate_means(fit1b, by = "release", ci = 0.95, transform = TRUE)Estimated Marginal Means
release | Median | 95% CI | pd | ROPE | % in ROPE
------------------------------------------------------------------------------
Older | 4.89e+05 | [4.07e+05, 5.82e+05] | 100% | [-0.10, 0.10] | 0%
After2005 | 4.29e+05 | [3.52e+05, 5.17e+05] | 100% | [-0.10, 0.10] | 0%
Variable predicted: imdb_ratings
Predictors modulated: release
Does this look like a meaningfully different set of results than we obtained with fit1? No, as it probably shouldn’t, given our choice of a weakly informative prior.
4.4 Use the bootstrap
Given the skew in the original data, we might consider applying the bootstrap if we still wanted to compare means. We’ll pool the variance estimates here with var.equal = TRUE, since we have a balanced design.
fit2 <- boot.t.test(imdb_ratings ~ release, data = movies_1,
var.equal = TRUE, conf.level = 0.95)
fit2
Bootstrap Two Sample t-test
data: imdb_ratings by release
number of bootstrap samples: 9999
bootstrap p-value = 0.2732
bootstrap difference of means (SE) = 80545.43 (72265.29)
95 percent bootstrap percentile confidence interval:
-60801.46 223390.85
Results without bootstrap:
t = 1.1079, df = 258, p-value = 0.2689
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-62486.79 223257.56
sample estimates:
mean in group Older mean in group After2005
627142.3 546756.9
For a movie released in 1940-2005 (the “older” period), our bootstrap fit2 estimates that this movie’s number of IMDB user ratings will be, on average, 80063 larger than for a movie released after 2005.
Suppose we assume again that the population of interest is “all favorite movies for students at CWRU in any class between 2020 and 2025” and we’re willing to believe that our sample of 260 movies is a representative (though not random) sample from that population.
When we generalize beyond the movies selected here to the population of interest, and we assume that our model fit2 is correct, then our sample data are compatible (at the 95% confidence level) with differences in the mean of IMDB user ratings between 62,947 smaller in the “older” group up to 223,437 larger in the “older” group.
This is appealing because it’s still an estimate about the mean of IMDB user ratings, but it’s debatable that the skew in our data lets us interpret the mean as the “center” of the distribution. We might prefer to look at a bootstrap confidence interval comparing the medians of the two groups.
The sample median imdb_ratings within the two groups are:
movies_1 |> group_by(release) |>
summarize(n = n(), median(imdb_ratings), mean(imdb_ratings))# A tibble: 2 × 4
release n `median(imdb_ratings)` `mean(imdb_ratings)`
<fct> <int> <dbl> <dbl>
1 Older 130 368000 627142.
2 After2005 130 399500 546757.
So I’ll set this interval up to look at the positive difference (399500 - 368000) obtained from “After2005” - “Older” comparing medians. Note that the sample means show the opposite sign as compared to this difference in medians.
set.seed(431)
sample_diff_in_meds <- movies_1 |>
specify(imdb_ratings ~ release) |>
calculate(stat = "diff in medians",
order = c("After2005", "Older") )
fit3 <- movies_1 |>
specify(imdb_ratings ~ release) |>
generate(reps = 2000, type = "bootstrap") |>
calculate(stat = "diff in medians",
order = c("After2005", "Older") ) |>
get_ci(level = 0.95, type = "percentile")
fit3 <- fit3 |>
mutate(point_est = sample_diff_in_meds) |>
relocate(point_est)
fit3# A tibble: 1 × 3
point_est$stat lower_ci upper_ci
<dbl> <dbl> <dbl>
1 31500 -111025 187012.
Across movies released after 2005, our bootstrap fit3 estimates that the median number of IMDB user ratings will be, on average, 31,500 larger than the median for movies released between 1940 and 2005 (the “older” period.)
Suppose we assume once again that the population of interest is “all favorite movies for students at CWRU in any class between 2020 and 2025” and we’re willing to believe that our sample of 260 movies is a representative (though not random) sample from that population.
When we generalize beyond the movies selected here to the population of interest, and we assume that our model fit3 is correct, then our sample data are compatible (at the 95% confidence level) with differences in the median of IMDB user ratings between 111,025 smaller in the “After2005” group up to 187,012 larger in the “After2005” group.
If we needed it, we could estimate a p value here with:
movies_1 |>
specify(imdb_ratings ~ release) |>
generate(reps = 2000, type = "bootstrap") |>
calculate(stat = "diff in medians",
order = c("After2005", "Older") ) |>
get_pvalue(obs_stat = sample_diff_in_meds, direction = "both") # A tibble: 1 × 1
p_value
<dbl>
1 0.985
4.5 Use a non-parametric Wilcoxon rank sum test
Yet another approach we might consider here is a Wilcoxon rank sum test. This test compares the locations without using either the mean or the median of the original differences, but instead first ranks the observed imdb_ratings values regardless of release group, and then compares the centers (locations) of those distributions.
wilcox.test(imdb_ratings ~ release, data = movies_1,
exact = FALSE, conf.int = TRUE, conf.level = 0.95)
Wilcoxon rank sum test with continuity correction
data: imdb_ratings by release
W = 8927.5, p-value = 0.4314
alternative hypothesis: true location shift is not equal to 0
95 percent confidence interval:
-53000 123000
sample estimates:
difference in location
31000
Of course, these results don’t describe either the difference in means or the difference in medians of the release groups on our outcome (imdb_ratings), so they’re meaningfully harder to think about, unless you’re only concerned with the p value, which is also pretty large, as we saw with the other methods shown above.
The table below shows point estimates and confidence intervals for the After2005 - Older difference in the statistic being compared:
| Method | Point Estimate | 95% CI | p value |
|---|---|---|---|
Compare means of square root of imdb_ratings |
-45.11 | (-133, 43) | 0.312 |
Bootstrap t test on means of imdb_ratings with equal variances assumed |
-80678 | (-222142, 62460) | 0.268 |
Pooled t test on means of imdb_ratings |
-80678 | (-223258, 62487) | 0.269 |
Bootstrap on medians of imdb_ratings |
31500 | (-111025, 187012) | 0.985 |
Wilcoxon rank sum test comparing locations of imdb_ratings |
31000 | (-53000, 123000) | 0.431 |
4.6 Answering Question 1
We have some conflicting results. Comparing sample means suggests that the older movies have slightly higher numbers of ratings, while comparing sample medians suggest that the movies released after 2005 have a few more ratings. In either case, though, the difference between the older and more recent movies after modeling appears to be small, relative to the variation in our data.
5 Question 2
- Which MPA categories have higher average IMDB star ratings? (
mpa4andimdb_stars)
We’ll restrict ourselves to our 4-category description of the mpa categories, and here’s a numerical summary of imdb_stars within each of those categories.
movies_1 |> group_by(mpa4) |> reframe(lovedist(imdb_stars)) |>
kable(digits = 2)| mpa4 | n | miss | mean | sd | med | mad | min | q25 | q75 | max |
|---|---|---|---|---|---|---|---|---|---|---|
| PG | 67 | 0 | 7.57 | 0.66 | 7.70 | 0.59 | 5.8 | 7.20 | 8.0 | 8.7 |
| PG-13 | 84 | 0 | 7.42 | 0.85 | 7.50 | 0.89 | 4.6 | 6.77 | 8.0 | 9.1 |
| R | 80 | 0 | 7.79 | 0.84 | 7.95 | 0.52 | 3.6 | 7.47 | 8.3 | 9.3 |
| Other | 29 | 0 | 7.35 | 1.14 | 7.70 | 0.59 | 3.4 | 7.10 | 8.0 | 8.6 |
So we don’t have a balanced design here. Let’s draw a comparison boxplot.
ggplot(movies_1, aes(x = mpa4, y = imdb_stars)) +
geom_violin(aes(fill = mpa4)) +
geom_boxplot(width = 0.2, notch = TRUE) +
stat_summary(fun = mean, geom = "point", size = 3, col = "red") +
scale_fill_viridis_d(alpha = 0.4) +
guides(fill = "none") +
labs(title = "Favorite Movies (2020-2025)",
x = "MPA Category", y = "IMDB Star Rating") 5.1 Analysis of Variance model
Let’s run an ANOVA model to compare the mean star ratings in each of these four mpa4 categories. Notice that the anova() and aov() functions provide some of the same information, just arranged differently.
fit4 <- lm(imdb_stars ~ mpa4, data = movies_1)
anova(fit4)Analysis of Variance Table
Response: imdb_stars
Df Sum Sq Mean Sq F value Pr(>F)
mpa4 3 7.16 2.38674 3.365 0.01925 *
Residuals 256 181.57 0.70927
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
aov(fit4)Call:
aov(formula = fit4)
Terms:
mpa4 Residuals
Sum of Squares 7.16023 181.57424
Deg. of Freedom 3 256
Residual standard error: 0.8421843
Estimated effects may be unbalanced
eta_squared(fit4)For one-way between subjects designs, partial eta squared is equivalent
to eta squared. Returning eta squared.
# Effect Size for ANOVA
Parameter | Eta2 | 95% CI
-------------------------------
mpa4 | 0.04 | [0.00, 1.00]
- One-sided CIs: upper bound fixed at [1.00].
Model fit4 estimates that the proportion of variation in imdb_stars explained by the mpa4 ranking category is only 4%, which doesn’t sound like much, even though the ANOVA F test has a relatively small p value.
5.1.1 Model Equation yields the Sample Means
In light of these results, let’s dig a little more deeply into this ANOVA model.
model_parameters(fit4, ci = 0.95, pretty_names = FALSE)Parameter | Coefficient | SE | 95% CI | t(256) | p
------------------------------------------------------------------
(Intercept) | 7.57 | 0.10 | [ 7.37, 7.77] | 73.58 | < .001
mpa4PG-13 | -0.15 | 0.14 | [-0.42, 0.12] | -1.10 | 0.274
mpa4R | 0.22 | 0.14 | [-0.05, 0.49] | 1.58 | 0.116
mpa4Other | -0.22 | 0.19 | [-0.59, 0.15] | -1.17 | 0.244
Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
using a Wald t-distribution approximation.
estimate_means(fit4, ci = 0.95, by = "mpa4")Estimated Marginal Means
mpa4 | Mean | SE | 95% CI | t(256)
-------------------------------------------
PG | 7.57 | 0.10 | [7.37, 7.77] | 73.58
PG-13 | 7.42 | 0.09 | [7.24, 7.60] | 80.74
R | 7.79 | 0.09 | [7.60, 7.98] | 82.73
Other | 7.35 | 0.16 | [7.04, 7.66] | 47.01
Variable predicted: imdb_stars
Predictors modulated: mpa4
We see that the model reproduces the sample means of the four groups, as it should, and that R has the highest sample mean while Other has the lowest across these four groups.
5.1.2 Contrasts
We can run contrasts to compare all pairs of means if we’re not concerned about the multiple comparisons problem like this.
estimate_contrasts(fit4, ci = 0.95, contrast = "mpa4")Marginal Contrasts Analysis
Level1 | Level2 | Difference | SE | 95% CI | t(256) | p
---------------------------------------------------------------------
PG-13 | PG | -0.15 | 0.14 | [-0.42, 0.12] | -1.10 | 0.274
R | PG | 0.22 | 0.14 | [-0.05, 0.49] | 1.58 | 0.116
Other | PG | -0.22 | 0.19 | [-0.59, 0.15] | -1.17 | 0.244
R | PG-13 | 0.37 | 0.13 | [ 0.11, 0.63] | 2.82 | 0.005
Other | PG-13 | -0.07 | 0.18 | [-0.42, 0.29] | -0.37 | 0.711
Other | R | -0.44 | 0.18 | [-0.80, -0.08] | -2.40 | 0.017
Variable predicted: imdb_stars
Predictors contrasted: mpa4
p-values are uncorrected.
5.1.3 Model Performance
model_performance(fit4)# Indices of model performance
AIC | AICc | BIC | R2 | R2 (adj.) | RMSE | Sigma
---------------------------------------------------------
654.5 | 654.7 | 672.3 | 0.038 | 0.027 | 0.836 | 0.842
As mentioned previously, the \(\eta^2\) (“eta-squared” or, more generally, R-squared) in this ANOVA model is only 3.8%, and we can see that the \(\sigma\) value is 0.842, implying that about 68% of this model’s predicted imdb_stars results should be within 0.842 of the actual imdb_stars value, and about 95% of the predictions should be within (0.842 x 2) = 1.684 stars. That also doesn’t sound great, in light of the fact that 75% of the movies have star ratings between 7.1 and 9.3, for example.
movies_1 |> reframe(lovedist(imdb_stars)) |> kable(digits = 2)| n | miss | mean | sd | med | mad | min | q25 | q75 | max |
|---|---|---|---|---|---|---|---|---|---|
| 260 | 0 | 7.56 | 0.85 | 7.7 | 0.74 | 3.4 | 7.1 | 8.1 | 9.3 |
5.2 Post-Hoc Bonferroni comparisons
Here are the family-wise 95% Bonferroni confidence intervals for each pairwise difference of means.
PostHocTest(aov(fit4), method = "bonferroni", conf.level = 0.95)
Posthoc multiple comparisons of means : Bonferroni
95% family-wise confidence level
$mpa4
diff lwr.ci upr.ci pval
PG-13-PG -0.15110163 -0.5178966 0.21569331 1.0000
R-PG 0.21985075 -0.1509906 0.59069212 0.6971
Other-PG -0.21842512 -0.7161750 0.27932472 1.0000
R-PG-13 0.37095238 0.0211287 0.72077606 0.0311 *
Other-PG-13 -0.06732348 -0.5496182 0.41497122 1.0000
Other-R -0.43827586 -0.9236551 0.04710335 0.1024
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Here’s a plot of those confidence intervals.
par(mar=c(3,6,3,3), las = 1)
plot(PostHocTest(aov(fit4), method = "bonferroni", conf.level = 0.95))It appears that the comparison of R vs. PG-13 is the only one with a confidence interval here that doesn’t include 0.
5.3 Post-Hoc Tukey HSD comparisons
A Tukey HSD approach is hard to justify here, since the design isn’t even close to being balanced. We can still run it, but I would likely use the Bonferroni intervals here.
PostHocTest(aov(fit4), method = "hsd", conf.level = 0.95)
Posthoc multiple comparisons of means : Tukey HSD
95% family-wise confidence level
$mpa4
diff lwr.ci upr.ci pval
PG-13-PG -0.15110163 -0.5078357 0.20563240 0.6927
R-PG 0.21985075 -0.1408187 0.58052022 0.3940
Other-PG -0.21842512 -0.7025221 0.26567182 0.6483
R-PG-13 0.37095238 0.0307241 0.71118066 0.0265 *
Other-PG-13 -0.06732348 -0.5363892 0.40174224 0.9825
Other-R -0.43827586 -0.9103415 0.03378976 0.0795 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
par(mar=c(3,6,3,3), las = 1)
plot(PostHocTest(aov(fit4), method = "hsd", conf.level = 0.95))Based on the Tukey HSD approach, it again appears that the comparison of R vs. PG-13 is the only one with a confidence interval here that doesn’t include 0.
5.4 Assessing ANOVA assumptions
The ANOVA assumptions are the same as any linear model, so we can run that complete set of diagnostic plots and see what’s happening.
check_model(fit4, detrend = FALSE)- The posterior predictive check isn’t terrible, but our model is predicting fewer results in the 8 to 8.5 star range than we see in the data, and more results above 9.0.
- The linearity plot shows no issues with this one-factor ANOVA.
- The homogeneity of variance plot suggests a potential issue with the group having fitted values (remember this is just a sample mean) below 7.4 (so that’s the “Other” group) showing a bit more variation than the other groups. We saw this earlier in the sample standard deviations in each group (and remember that “Other” also has many fewer movies than the other three groups.)
- There are no highly influential observations affecting this ANOVA model.
- The Normality of the ANOVA residuals looks like our biggest problem, with some left skew, especially shown by the dip below the line on the left of the plot.
On the whole, though, I’d be inclined to stick with this model for now. The fundamental conclusion here is that the mpa4 groups, by themselves, do a poor job of predicting imdb_stars.
5.5 Bayesian model?
We could have run a Bayesian model instead here, although it has the same issues with assumptions and the transformation as our fit4 model.
set.seed(43104)
fit4b <- stan_glm(imdb_stars ~ mpa4,
data = movies_1, refresh = 0)
model_parameters(fit4b, ci = 0.95, pretty_names = FALSE)Parameter | Median | 95% CI | pd | Rhat | ESS | Prior
------------------------------------------------------------------------------------
(Intercept) | 7.57 | [ 7.37, 7.77] | 100% | 1.001 | 2727 | Normal (7.56 +- 2.13)
mpa4PG-13 | -0.15 | [-0.42, 0.11] | 87.02% | 1.000 | 3086 | Normal (0.00 +- 4.55)
mpa4R | 0.22 | [-0.06, 0.50] | 94.62% | 1.002 | 2980 | Normal (0.00 +- 4.61)
mpa4Other | -0.22 | [-0.59, 0.14] | 87.33% | 1.000 | 3208 | Normal (0.00 +- 6.77)
Uncertainty intervals (equal-tailed) computed using a MCMC distribution
approximation.
Does this look like a meaningfully different set of results than we obtained with fit4? No, and we don’t (yet) have an easy way to even get an ANOVA table in this case.
5.6 Answering Question 2
Answering the question for our sample, though, it looks like the R-rated movies have slightly higher mean imdb_stars than do the PG-13 movies, using a 95% Bonferroni comparison, although the model is not strong at all.
6 Question 3
How strong is the association between hundreds of thousands of user ratings (ratings100K) and the weighted average star rating (imdb_stars)?
dat2 <- bind_rows(
movies_1 |> reframe(lovedist(imdb_stars)),
movies_1 |> reframe(lovedist(ratings100K))
)
dat2 <- dat2 |>
mutate(dat2, var_name = c("imdb_stars", "ratings100K")) |>
relocate(var_name)
kable(dat2, digits = 1) | var_name | n | miss | mean | sd | med | mad | min | q25 | q75 | max |
|---|---|---|---|---|---|---|---|---|---|---|
| imdb_stars | 260 | 0 | 7.6 | 0.9 | 7.7 | 0.7 | 3.4 | 7.1 | 8.1 | 9.3 |
| ratings100K | 260 | 0 | 5.9 | 5.9 | 3.7 | 4.0 | 0.0 | 1.7 | 8.4 | 31.0 |
Let’s look at imdb_stars (on the y axis) and ratings100K (hundreds of thousands of IMDB user ratings) on the x axis.
ggplot(movies_1, aes(x = ratings100K, y = imdb_stars)) +
geom_point() +
geom_smooth(method = "loess",
formula = y ~ x, se = FALSE, col = "blue") +
geom_smooth(method = "lm",
formula = y ~ x, se = TRUE, col = "red") +
labs(x = "Hundreds of Thousands of IMDB ratings",
y = "Weighted average star rating")The linear model (in red) looks like it might work pretty well, once we get between 500,000 and 2,000,000 IMDB ratings, but isn’t doing as well at the low and high ends of that variable.
6.1 Pearson correlation
The Pearson correlation coefficient might also help.
movies_1 |> select(ratings100K, imdb_stars) |> correlation()# Correlation Matrix (pearson-method)
Parameter1 | Parameter2 | r | 95% CI | t(258) | p
-------------------------------------------------------------------
ratings100K | imdb_stars | 0.62 | [0.54, 0.69] | 12.61 | < .001***
p-value adjustment method: Holm (1979)
Observations: 260
It looks like the Pearson correlation is 0.62 in our sample.
6.2 Simple Regression Model and its Equation
The reason I am focusing on ratings100K instead of imdb_ratings is as follows:
fit5 <- lm(imdb_stars ~ imdb_ratings, data = movies_1)
model_parameters(fit5, ci = 0.95, pretty_names = FALSE)Parameter | Coefficient | SE | 95% CI | t(258) | p
----------------------------------------------------------------------
(Intercept) | 7.04 | 0.06 | [6.92, 7.15] | 118.94 | < .001
imdb_ratings | 9.01e-07 | 7.14e-08 | [0.00, 0.00] | 12.61 | < .001
Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
using a Wald t-distribution approximation.
It is very hard to conceptualize \(9.01 \times 10^{−7}\) in any practical context, plus the 95% CI for the slope of imdb_ratings is now 0. That’s not helping me understand what’s going on.
fit6 <- lm(imdb_stars ~ ratings100K, data = movies_1)
model_parameters(fit6, ci = 0.95, pretty_names = FALSE)Parameter | Coefficient | SE | 95% CI | t(258) | p
---------------------------------------------------------------------
(Intercept) | 7.04 | 0.06 | [6.92, 7.15] | 118.94 | < .001
ratings100K | 0.09 | 7.14e-03 | [0.08, 0.10] | 12.61 | < .001
Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
using a Wald t-distribution approximation.
Actually, I’d like to get another decimal place here.
model_parameters(fit6, ci = 0.95, pretty_names = FALSE, digits = 3)Parameter | Coefficient | SE | 95% CI | t(258) | p
---------------------------------------------------------------------
(Intercept) | 7.036 | 0.059 | [6.919, 7.152] | 118.943 | < .001
ratings100K | 0.090 | 0.007 | [0.076, 0.104] | 12.607 | < .001
Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
using a Wald t-distribution approximation.
This is a little more interpretible. If we have two movies, one of whom has 100,000 more IMDB user ratings than the other, then this model fit6 estimates that the movie with more ratings will have a weighted average stars rating (imdb_stars) that is 0.09 higher than the less rated movie.
Suppose again we assume that the population of interest is “all favorite movies for students at CWRU in any class between 2020 and 2025” and we’re willing to believe that our sample of 260 movies is a representative (though not random) sample from that population.
When we generalize beyond the movies selected here to this population of interest, then our sample data are compatible (at the 95% confidence level) with slopes for the relationship between imdb_stars and ratings100K ranging from 0.076 to 0.104, assuming our fit6 model is correct.
6.3 Model Performance
Let’s look at the model’s performance.
model_performance(fit6)# Indices of model performance
AIC | AICc | BIC | R2 | R2 (adj.) | RMSE | Sigma
---------------------------------------------------------
535.8 | 535.9 | 546.4 | 0.381 | 0.379 | 0.670 | 0.673
The \(R^2\) tells us that the
fit6model accounts for 38.1% of the variation inimdb_starsusingratings100Kas a predictor.We also see \(\sigma = 0.673\). This suggests that about 68% of our predictions of
imdb_starswill be within \(\pm 0.673\) of the observedimdb_stars, and that about 95% will be within \(\pm 1.346\). This is better than we did with our ANOVA model in Question 2 (looking atmpa4) but still not terrific, given that the top 75% of our movies all fall in the range of 7.1 to 9.3 stars.
6.4 Assumption Checking
Let’s check our assumptions with diagnostic plots.
check_model(fit6, detrend = FALSE)- The posterior predictive check is a bit off. This
fit6model is predicting fewer results in the 7,5 to 8.5 star range than we see in the data, and more results below 7.5 and above 8.75, it seems. - The linearity plot shows a bit of a curve in the reference line.
- The homogeneity of variance plot suggests a fairly substantial issue as well, as the line is curved, rather than flat and horizontal.
- There are no highly influential observations affecting this ANOVA model.
- The Normality of the ANOVA residuals also indicates some fairly substantial left skew.
6.5 Bayesian linear model?
Do we obtain meaningfully different results from a Bayesian version of this model, with a (default) weakly informative prior?
fit6b <- stan_glm(imdb_stars ~ ratings100K,
data = movies_1, refresh = 0)
model_parameters(fit6b, ci = 0.95, pretty_names = FALSE, digits = 3)Parameter | Median | 95% CI | pd | Rhat | ESS | Prior
-----------------------------------------------------------------------------------
(Intercept) | 7.036 | [6.922, 7.151] | 100% | 1.000 | 4326 | Normal (7.56 +- 2.13)
ratings100K | 0.090 | [0.076, 0.104] | 100% | 0.999 | 3987 | Normal (0.00 +- 0.36)
Uncertainty intervals (equal-tailed) computed using a MCMC distribution
approximation.
model_performance(fit6b)# Indices of model performance
ELPD | ELPD_SE | LOOIC | LOOIC_SE | WAIC | R2 | R2 (adj.) | RMSE | Sigma
---------------------------------------------------------------------------------
-269.370 | 24.030 | 538.7 | 48.060 | 538.7 | 0.380 | 0.376 | 0.670 | 0.675
check_model(fit6b, detrend = FALSE)The fit6 and fit6b models are pretty similar in terms of estimated parameters, performance metrics and diagnostic checks. I’m going to conclude that the Bayesian model here looks pretty much like our ordinary least squares model, so I’ll move on.
6.6 Considering a Transformation
Could a transformation of our outcome be helpful here?
boxCox(fit6)The Box-Cox plot is a little cut off on the right side. Can we fix that?
boxCox(fit6, lambda = seq(-2, 10, 1/5))This seems to suggest that we take the imdb_stars value to about the 5th power. That seems ridiculous - in practical work, I would never use a power higher than 3 on an outcome.
Suppose we instead just cube the imdb_stars. How’s that scatterplot look?
ggplot(movies_1, aes(x = ratings100K, y = (imdb_stars^3))) +
geom_point() +
geom_smooth(method = "loess",
formula = y ~ x, se = FALSE, col = "blue") +
geom_smooth(method = "lm",
formula = y ~ x, se = TRUE, col = "red") +
labs(x = "Hundreds of Thousands of IMDB ratings",
y = "Cube of Weighted average star rating")It’s not obvious to me that this is going to be a meaningful improvement, but we can check it.
6.7 New model for transformed outcome
Here’s the new model:
fit7 <- lm(imdb_stars^3 ~ ratings100K, data = movies_1)
model_parameters(fit7)Parameter | Coefficient | SE | 95% CI | t(258) | p
---------------------------------------------------------------------
(Intercept) | 356.20 | 8.27 | [339.92, 372.48] | 43.09 | < .001
ratings100K | 15.73 | 1.00 | [ 13.76, 17.69] | 15.75 | < .001
Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
using a Wald t-distribution approximation.
We might consider summaries of its performance (remember, though, we changed the outcome to the square of imdb_stars so these results, like the estimated coefficients, \(R^2\) and \(\sigma\) are NOT comparable to the values we got for model fit6.)
model_performance(fit7)# Indices of model performance
AIC | AICc | BIC | R2 | R2 (adj.) | RMSE | Sigma
-----------------------------------------------------------
436.4 | 436.5 | 447.1 | 0.490 | 0.488 | 93.650 | 94.012
Perhaps the most important thing to do here is see if the transformation meaningfully improved our adherence to the assumptions of a linear model. Let’s focus on that.
check_model(fit7, detrend = FALSE)If anything, this looks meaningfully worse, in terms of the posterior predictive check, so it doesn’t look like a simple transformation of the outcome is going to fix our problem.
6.8 What if we transformed both Y and X?
See Section 11.6 of our Course Book for another example of a log-log model like this, and its interpretation.
ggplot(movies_1, aes(x = log(ratings100K), y = log(imdb_stars))) +
geom_point() +
geom_smooth(method = "loess",
formula = y ~ x, se = FALSE, col = "blue") +
geom_smooth(method = "lm",
formula = y ~ x, se = TRUE, col = "red") +
labs(x = "Logarithm of Hundreds of Thousands of IMDB ratings",
y = "Logarithm of Weighted average star rating")Does this log-log pair of transformations fare any better, in terms of assumptions?
fit8 <- lm(log(imdb_stars) ~ log(ratings100K), data = movies_1)
model_parameters(fit8, ci = 0.95, pretty_names = FALSE)Parameter | Coefficient | SE | 95% CI | t(258) | p
--------------------------------------------------------------------------
(Intercept) | 1.95 | 8.59e-03 | [1.93, 1.96] | 226.50 | < .001
log(ratings100K) | 0.06 | 4.92e-03 | [0.05, 0.07] | 12.21 | < .001
Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
using a Wald t-distribution approximation.
The model has a log-transformed response variable. Consider using
`exponentiate = TRUE` to interpret coefficients as ratios.
model_performance(fit8)# Indices of model performance
AIC | AICc | BIC | R2 | R2 (adj.) | RMSE | Sigma
---------------------------------------------------------
603.0 | 603.1 | 613.6 | 0.366 | 0.364 | 0.102 | 0.102
check_model(fit8, detrend = FALSE)Nope. This is worse than what we started with.
6.9 Answering Question 3
Our regression model fit6 without any transformations explained about 38.1% of the variation in imdb_stars using ratings100K as a predictor. That looks to be our best option right now to answer our question 3.
7 Question 4
- Which movie genres have the highest weighted average star ratings on IMDB?
7.1 Creating a New Data Set
Some of my least interesting programming follows.
row01 <- movies_1 |>
filter(action == 1) |>
summarise(genre = "action",
n = n(),
mean_stars = mean(imdb_stars),
sd_stars = sd(imdb_stars),
median_stars = median(imdb_stars),
mad_stars = mad(imdb_stars),
q25_stars = quantile(imdb_stars, 0.25),
q75_stars = quantile(imdb_stars, 0.75))row02 <- movies_1 |>
filter(adventure == 1) |>
summarise(genre = "adventure",
n = n(),
mean_stars = mean(imdb_stars),
sd_stars = sd(imdb_stars),
median_stars = median(imdb_stars),
mad_stars = mad(imdb_stars),
q25_stars = quantile(imdb_stars, 0.25),
q75_stars = quantile(imdb_stars, 0.75))row03 <- movies_1 |>
filter(animation == 1) |>
summarise(genre = "animation",
n = n(),
mean_stars = mean(imdb_stars),
sd_stars = sd(imdb_stars),
median_stars = median(imdb_stars),
mad_stars = mad(imdb_stars),
q25_stars = quantile(imdb_stars, 0.25),
q75_stars = quantile(imdb_stars, 0.75))row04 <- movies_1 |>
filter(biography == 1) |>
summarise(genre = "biography",
n = n(),
mean_stars = mean(imdb_stars),
sd_stars = sd(imdb_stars),
median_stars = median(imdb_stars),
mad_stars = mad(imdb_stars),
q25_stars = quantile(imdb_stars, 0.25),
q75_stars = quantile(imdb_stars, 0.75))row05 <- movies_1 |>
filter(comedy == 1) |>
summarise(genre = "comedy",
n = n(),
mean_stars = mean(imdb_stars),
sd_stars = sd(imdb_stars),
median_stars = median(imdb_stars),
mad_stars = mad(imdb_stars),
q25_stars = quantile(imdb_stars, 0.25),
q75_stars = quantile(imdb_stars, 0.75))row06 <- movies_1 |>
filter(crime == 1) |>
summarise(genre = "crime",
n = n(),
mean_stars = mean(imdb_stars),
sd_stars = sd(imdb_stars),
median_stars = median(imdb_stars),
mad_stars = mad(imdb_stars),
q25_stars = quantile(imdb_stars, 0.25),
q75_stars = quantile(imdb_stars, 0.75))row07 <- movies_1 |>
filter(drama == 1) |>
summarise(genre = "drama",
n = n(),
mean_stars = mean(imdb_stars),
sd_stars = sd(imdb_stars),
median_stars = median(imdb_stars),
mad_stars = mad(imdb_stars),
q25_stars = quantile(imdb_stars, 0.25),
q75_stars = quantile(imdb_stars, 0.75))row08 <- movies_1 |>
filter(family == 1) |>
summarise(genre = "family",
n = n(),
mean_stars = mean(imdb_stars),
sd_stars = sd(imdb_stars),
median_stars = median(imdb_stars),
mad_stars = mad(imdb_stars),
q25_stars = quantile(imdb_stars, 0.25),
q75_stars = quantile(imdb_stars, 0.75))row09 <- movies_1 |>
filter(fantasy == 1) |>
summarise(genre = "fantasy",
n = n(),
mean_stars = mean(imdb_stars),
sd_stars = sd(imdb_stars),
median_stars = median(imdb_stars),
mad_stars = mad(imdb_stars),
q25_stars = quantile(imdb_stars, 0.25),
q75_stars = quantile(imdb_stars, 0.75))row10 <- movies_1 |>
filter(history == 1) |>
summarise(genre = "history",
n = n(),
mean_stars = mean(imdb_stars),
sd_stars = sd(imdb_stars),
median_stars = median(imdb_stars),
mad_stars = mad(imdb_stars),
q25_stars = quantile(imdb_stars, 0.25),
q75_stars = quantile(imdb_stars, 0.75))row11 <- movies_1 |>
filter(horror == 1) |>
summarise(genre = "horror",
n = n(),
mean_stars = mean(imdb_stars),
sd_stars = sd(imdb_stars),
median_stars = median(imdb_stars),
mad_stars = mad(imdb_stars),
q25_stars = quantile(imdb_stars, 0.25),
q75_stars = quantile(imdb_stars, 0.75))row12 <- movies_1 |>
filter(mystery == 1) |>
summarise(genre = "mystery",
n = n(),
mean_stars = mean(imdb_stars),
sd_stars = sd(imdb_stars),
median_stars = median(imdb_stars),
mad_stars = mad(imdb_stars),
q25_stars = quantile(imdb_stars, 0.25),
q75_stars = quantile(imdb_stars, 0.75))row13 <- movies_1 |>
filter(music == 1) |>
summarise(genre = "music",
n = n(),
mean_stars = mean(imdb_stars),
sd_stars = sd(imdb_stars),
median_stars = median(imdb_stars),
mad_stars = mad(imdb_stars),
q25_stars = quantile(imdb_stars, 0.25),
q75_stars = quantile(imdb_stars, 0.75))row14 <- movies_1 |>
filter(musical == 1) |>
summarise(genre = "musical",
n = n(),
mean_stars = mean(imdb_stars),
sd_stars = sd(imdb_stars),
median_stars = median(imdb_stars),
mad_stars = mad(imdb_stars),
q25_stars = quantile(imdb_stars, 0.25),
q75_stars = quantile(imdb_stars, 0.75))row15 <- movies_1 |>
filter(romance == 1) |>
summarise(genre = "romance",
n = n(),
mean_stars = mean(imdb_stars),
sd_stars = sd(imdb_stars),
median_stars = median(imdb_stars),
mad_stars = mad(imdb_stars),
q25_stars = quantile(imdb_stars, 0.25),
q75_stars = quantile(imdb_stars, 0.75))row16 <- movies_1 |>
filter(scifi == 1) |>
summarise(genre = "scifi",
n = n(),
mean_stars = mean(imdb_stars),
sd_stars = sd(imdb_stars),
median_stars = median(imdb_stars),
mad_stars = mad(imdb_stars),
q25_stars = quantile(imdb_stars, 0.25),
q75_stars = quantile(imdb_stars, 0.75))row17 <- movies_1 |>
filter(sport == 1) |>
summarise(genre = "sport",
n = n(),
mean_stars = mean(imdb_stars),
sd_stars = sd(imdb_stars),
median_stars = median(imdb_stars),
mad_stars = mad(imdb_stars),
q25_stars = quantile(imdb_stars, 0.25),
q75_stars = quantile(imdb_stars, 0.75))row18 <- movies_1 |>
filter(thriller == 1) |>
summarise(genre = "thriller",
n = n(),
mean_stars = mean(imdb_stars),
sd_stars = sd(imdb_stars),
median_stars = median(imdb_stars),
mad_stars = mad(imdb_stars),
q25_stars = quantile(imdb_stars, 0.25),
q75_stars = quantile(imdb_stars, 0.75))row19 <- movies_1 |>
filter(war == 1) |>
summarise(genre = "war",
n = n(),
mean_stars = mean(imdb_stars),
sd_stars = sd(imdb_stars),
median_stars = median(imdb_stars),
mad_stars = mad(imdb_stars),
q25_stars = quantile(imdb_stars, 0.25),
q75_stars = quantile(imdb_stars, 0.75))row20 <- movies_1 |>
filter(western == 1) |>
summarise(genre = "western",
n = n(),
mean_stars = mean(imdb_stars),
sd_stars = sd(imdb_stars),
median_stars = median(imdb_stars),
mad_stars = mad(imdb_stars),
q25_stars = quantile(imdb_stars, 0.25),
q75_stars = quantile(imdb_stars, 0.75))mov1_summary <- bind_rows(row01, row02, row03, row04,
row05, row06, row07, row08,
row09, row10, row11, row12,
row13, row14, row15, row16,
row17, row18, row19, row20)mov1_summary |> arrange(desc(mean_stars)) |>
kable()| genre | n | mean_stars | sd_stars | median_stars | mad_stars | q25_stars | q75_stars |
|---|---|---|---|---|---|---|---|
| war | 11 | 8.063636 | 0.4500505 | 8.10 | 0.59304 | 7.800 | 8.450 |
| biography | 17 | 7.929412 | 0.4194710 | 8.00 | 0.29652 | 7.800 | 8.200 |
| history | 6 | 7.916667 | 0.4355074 | 8.05 | 0.29652 | 7.850 | 8.175 |
| western | 2 | 7.850000 | 0.0707107 | 7.85 | 0.07413 | 7.825 | 7.875 |
| crime | 34 | 7.720588 | 0.9741444 | 7.85 | 0.81543 | 7.225 | 8.300 |
| animation | 29 | 7.703448 | 0.5784726 | 7.70 | 0.44478 | 7.600 | 8.000 |
| drama | 153 | 7.697386 | 0.8402653 | 7.80 | 0.59304 | 7.300 | 8.200 |
| thriller | 50 | 7.622000 | 0.7089198 | 7.70 | 0.74130 | 7.200 | 8.175 |
| adventure | 87 | 7.614942 | 0.7533690 | 7.70 | 0.74130 | 7.200 | 8.200 |
| scifi | 51 | 7.596078 | 0.8175477 | 7.70 | 0.88956 | 7.100 | 8.300 |
| mystery | 26 | 7.576923 | 0.6901059 | 7.70 | 0.59304 | 7.225 | 7.900 |
| fantasy | 60 | 7.560000 | 0.8793680 | 7.70 | 0.59304 | 7.200 | 8.000 |
| action | 60 | 7.508333 | 0.8415667 | 7.50 | 1.03782 | 6.875 | 8.200 |
| family | 44 | 7.422727 | 0.7426739 | 7.70 | 0.59304 | 7.075 | 7.900 |
| romance | 62 | 7.406452 | 0.6885173 | 7.45 | 0.66717 | 7.025 | 7.900 |
| music | 30 | 7.393333 | 0.8529597 | 7.55 | 0.81543 | 6.850 | 8.000 |
| sport | 6 | 7.366667 | 0.4501851 | 7.50 | 0.37065 | 7.175 | 7.675 |
| comedy | 103 | 7.250485 | 0.8209122 | 7.40 | 0.74130 | 6.900 | 7.800 |
| musical | 17 | 7.205882 | 0.6823403 | 7.30 | 0.59304 | 6.800 | 7.700 |
| horror | 12 | 6.966667 | 1.5417424 | 7.15 | 1.55673 | 6.450 | 8.175 |
7.2 What about overlap?
Of course, I’ve completely ignored the issue of movies with multiple genres here.
For example, suppose we’re interested in comedies and dramas, plus the movies that are both, and those that are neither. We’d need a summary like this.
movies_1 |> group_by(comedy, drama) |>
summarize(n = n(),
mean_stars = mean(imdb_stars),
sd_stars = sd(imdb_stars),
median_stars = median(imdb_stars),
mad_stars = mad(imdb_stars),
.groups = "keep") |>
kable(digits = 2)| comedy | drama | n | mean_stars | sd_stars | median_stars | mad_stars |
|---|---|---|---|---|---|---|
| 0 | 0 | 54 | 7.64 | 0.77 | 7.8 | 0.74 |
| 0 | 1 | 103 | 7.84 | 0.83 | 8.0 | 0.59 |
| 1 | 0 | 53 | 7.10 | 0.82 | 7.1 | 0.89 |
| 1 | 1 | 50 | 7.41 | 0.79 | 7.6 | 0.59 |
7.3 A Cleveland dot plot
Here is a Cleveland dot plot of the summarized data on the individual genres. Cleveland dot plots are an alternative to bar graphs that reduce visual clutter and may be easier to read.
ggplot(data = mov1_summary, aes(x = mean_stars, y = genre)) +
geom_point()Here is a fancier version, based on the recipe in Section 3.10 of the R Graphics Cookbook…
ggplot(data = mov1_summary,
aes(x = mean_stars, y = reorder(genre, mean_stars))) +
geom_point(size = 3) +
theme(
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.major.y = element_line(colour = "grey60", linetype = "dashed")
) +
labs(y = "Movie Genre", x = "Mean Star Rating on IMDB")Here is a plot of the median, 25th and 75th percentiles of star ratings for each genre instead.
ggplot(data = mov1_summary,
aes(x = median_stars, y = reorder(genre, median_stars))) +
geom_pointrange(aes(xmin = q25_stars, xmax = q75_stars)) +
labs(y = "Movie Genre", x = "Median (and Q25, Q75) of Star Rating on IMDB")Note that there are only two western movies in our data, so calculating the percentiles is a problem.
8 Session Information
session_info()R version 4.5.1 (2025-06-13 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26100)
Locale:
LC_COLLATE=English_United States.utf8
LC_CTYPE=English_United States.utf8
LC_MONETARY=English_United States.utf8
LC_NUMERIC=C
LC_TIME=English_United States.utf8
Package version:
abind_1.4-8 arrangements_1.1.9 askpass_1.2.1
backports_1.5.0 base_4.5.1 base64enc_0.1-3
bayesplot_1.14.0 bayestestR_0.17.0 BH_1.87.0.1
bit_4.6.0 bit64_4.6.0.1 blob_1.2.4
boot_1.3-32 broom_1.0.10 bslib_0.9.0
cachem_1.1.0 callr_3.7.6 car_3.1-3
carData_3.0-5 cellranger_1.1.0 checkmate_2.3.3
class_7.3-23 cli_3.6.5 clipr_0.8.0
coda_0.19-4.1 codetools_0.2-20 collapse_2.1.3
colourpicker_1.3.0 commonmark_2.0.0 compiler_4.5.1
conflicted_1.2.0 correlation_0.8.8 cowplot_1.2.0
cpp11_0.5.2 crayon_1.5.3 crosstalk_1.2.2
curl_7.0.0 data.table_1.17.8 datasets_4.5.1
datawizard_1.2.0 DBI_1.2.3 dbplyr_2.5.1
Deriv_4.2.0 desc_1.4.3 DescTools_0.99.60
digest_0.6.37 distributional_0.5.0 doBy_4.7.0
dplyr_1.1.4 DT_0.34.0 dtplyr_1.3.2
dygraphs_1.1.1.6 e1071_1.7-16 easystats_0.7.5
effectsize_1.0.1 emmeans_1.11.2-8 estimability_1.5.1
evaluate_1.0.5 Exact_3.3 exactRankTests_0.8-35
expm_1.0-0 farver_2.1.2 fastmap_1.2.0
fontawesome_0.5.3 forcats_1.0.1 foreach_1.5.2
Formula_1.2-5 fs_1.6.6 gargle_1.6.0
generics_0.1.4 ggplot2_4.0.0 ggrepel_0.9.6
ggridges_0.5.7 gld_2.6.8 glmnet_4.1-10
glue_1.8.0 gmp_0.7-5 googledrive_2.1.2
googlesheets4_1.1.2 graphics_4.5.1 grDevices_4.5.1
grid_4.5.1 gridExtra_2.3 gtable_0.3.6
gtools_3.9.5 haven_2.5.5 highr_0.11
hms_1.1.3 htmltools_0.5.8.1 htmlwidgets_1.6.4
httpuv_1.6.16 httr_1.4.7 ids_1.0.1
igraph_2.1.4 infer_1.0.9 inline_0.3.21
insight_1.4.2 isoband_0.2.7 iterators_1.0.14
janitor_2.2.1 jomo_2.7-6 jquerylib_0.1.4
jsonlite_2.0.0 kableExtra_1.4.0 knitr_1.50
labeling_0.4.3 later_1.4.4 lattice_0.22-7
lazyeval_0.2.2 lifecycle_1.0.4 litedown_0.7
lme4_1.1-37 lmom_3.2 loo_2.8.0
lubridate_1.9.4 magrittr_2.0.4 marginaleffects_0.30.0
markdown_2.0 MASS_7.3-65 Matrix_1.7-4
MatrixModels_0.5.4 matrixStats_1.5.0 memoise_2.0.1
methods_4.5.1 mgcv_1.9-3 mice_3.18.0
miceadds_3.18-36 microbenchmark_1.5.0 mime_0.13
miniUI_0.1.2 minqa_1.2.8 mitml_0.4-5
mitools_2.4 MKdescr_0.9 MKinfer_1.2
modelbased_0.13.0 modelr_0.1.11 multcomp_1.4-28
mvtnorm_1.3-3 naniar_1.1.0 nlme_3.1-168
nloptr_2.2.1 nnet_7.3-20 norm_1.0.11.1
numDeriv_2016.8.1.1 openssl_2.3.4 ordinal_2023.12.4.1
pan_1.9 parallel_4.5.1 parameters_0.28.2
patchwork_1.3.2 pbkrtest_0.5.5 performance_0.15.1
pillar_1.11.1 pkgbuild_1.4.8 pkgconfig_2.0.3
plyr_1.8.9 posterior_1.6.1 prettyunits_1.2.0
processx_3.8.6 progress_1.2.3 promises_1.3.3
proxy_0.4-27 ps_1.9.1 purrr_1.1.0
quantreg_6.1 QuickJSR_1.8.1 R6_2.6.1
ragg_1.5.0 rappdirs_0.3.3 rbibutils_2.3
RColorBrewer_1.1-3 Rcpp_1.1.0 RcppArmadillo_15.0.2.2
RcppEigen_0.3.4.0.2 RcppParallel_5.1.11-1 Rdpack_2.6.4
readr_2.1.5 readxl_1.4.5 reformulas_0.4.1
rematch_2.0.0 rematch2_2.1.2 report_0.6.1
reprex_2.1.1 reshape2_1.4.4 rlang_1.1.6
rmarkdown_2.30 rootSolve_1.8.2.4 rpart_4.1.24
rstan_2.32.7 rstanarm_2.32.2 rstantools_2.5.0
rstudioapi_0.17.1 rvest_1.0.5 S7_0.2.0
sandwich_3.1-1 sass_0.4.10 scales_1.4.0
see_0.12.0 selectr_0.4.2 shape_1.4.6.1
shiny_1.11.1 shinyjs_2.1.0 shinystan_2.6.0
shinythemes_1.2.0 snakecase_0.11.1 sourcetools_0.1.7.1
SparseM_1.84.2 splines_4.5.1 StanHeaders_2.32.10
stats_4.5.1 stats4_4.5.1 stringi_1.8.7
stringr_1.5.2 survival_3.8-3 svglite_2.2.1
sys_3.4.3 systemfonts_1.3.0 tensorA_0.36.2.1
textshaping_1.0.3 TH.data_1.1-4 threejs_0.3.4
tibble_3.3.0 tidyr_1.3.1 tidyselect_1.2.1
tidyverse_2.0.0 timechange_0.3.0 tinytex_0.57
tools_4.5.1 tzdb_0.5.0 ucminf_1.2.2
UpSetR_1.4.0 utf8_1.2.6 utils_4.5.1
uuid_1.2.1 V8_8.0.0 vctrs_0.6.5
viridis_0.6.5 viridisLite_0.4.2 visdat_0.6.0
vroom_1.6.6 withr_3.0.2 xfun_0.53
xml2_1.4.0 xtable_1.8-4 xts_0.14.1
yaml_2.3.10 zoo_1.8-14